Machine Learning in Public Health
Lecture 1: What is Machine Learning?
Dr. Yang Feng
Today’s agenda
- Introduction of the instructor and the CAs
- Go over the syllabus
- From Statistics to Machine Learning
- Supervised Learning vs. Unsupervised Learning
- Accessing Model Accuracy
- Bias and Variance Trade-off
About the instructor
- Associate Professor of Biostatistics
- Ph.D. in Operations Research at Princeton University, 2010.
- Has published multiple academic papers on the topic of machine learning. So well qualified to teach…
To learn more about me
Brilliant Course Assistants

Yuyu (Ruby) Chen is a doctoral student at NYU School of Global Public Health specializing in Biostatistics. She was involved in multiple collaborative projects and consulting tasks, including Bayesian Adaptive Platform clinical trials, meta-analysis, machine learning and several longitudinal/cohort studies during her time at NYU. She is interested in using data and novel methods to address public health issues and find optimal clinical solutions with statistical approaches.
Brilliant Course Assistants

Yu Meng a second-year student from MS Biostatistics.
I’m currently working on the association between health care coverage with several physical health risk factors. I love traveling. And I love snow days~ Hope we can have a great semester!
Brilliant Course Assistants

Yuan Zhao is a third year PhD student in Epidemiology, her research mainly focused on causal inference using targeted maximum likelihood estimation (TMLE) framework and machine learning to predict hospital admission. She’s especially interested in applying social determinants to improve algorithmic fairness/interpretability using health care data.
Brilliant Course Assistants

Jianan (Zoe) Zhu is currently a second-year student of Biostatistics in MS at GPH.
My research interest is machine learning with applications to public health. In my spare time, I like to enjoy various cuisines in NYC.
Let’s start from “Statistics”
- Statistics is an old term.
- Statistics is the study of the collection, analysis, interpretation, presentation, and organization of data
- Statistics is much more than counting in agriculture, compiling baseball scores, creating life and death comparison data, etc!
- Canadian science philosopher Ian Hacking (1936-) captures the essence of Statistics:
- The quiet statisticians have changed our world - not by discovering new facts or technical developments but by changing the ways we reason, experiment and form our opinions about it.
21st century: big data!

Machine learning
Machine learning constructs algorithms that can learn from data, especially for prediction
Data Science is the extraction of knowledge from data, using ideas from mathematics, statistics, machine learning, computer programming, data engineering …
Other terms you encounter often: statistical learning, artificial intelligence (AI), deep learning, etc.
Structured vs. unstructured data
Structured data: a flat file with a fixed number of measurements, e.g., patient response to a drug under consideration of certain conditions (such as age, weight, size, nutrition intake)
Unstructured data: doctor’s notes, Twitter feeds, broker reports
We focus on structured data in this course.
Machine learning examples
Use classification techniques to classify which type of disease a patient has.
Use regression techniques to predict the BMI value using features related to living style.
Regression and classification are examples of Supervised Learning techniques (see next slide)
Supervised learning paradigm

Supervised Learning (Training)

Supervised Learning (Prediction on Test Data)

First supervised learning example: diamond price prediction
- Task: predict diamond price based on weight (regression)

Classification or Regression?
Second example: cancer diagnosis (benign, malignant)
- This breast cancer dataset was obtained from the University of Wisconsin Hospitals, Madison from Dr. William H. Wolberg. (2 codes benign and 4 codes malignant).

Classification or Regression?
Last example: AI vs. doctors

Unsupervised Learning

Supervised vs. Unsupervised Learning
Supervised: Both inputs (features, a.k.a. covariates, a.k.a. independent variables) and outputs (labels, a.k.a. response, a.k.a. dependent variable) in training set.
Unsupervised: No output values available, just inputs.

Sullabus highlights (cont)
What this course is about?
- We will cover a wide range of topics in machine learning.
- We will try to understand on an intuitive level why these algorithms work.
- This course is not a theory course, but minimum math/prob notations are unavoidable.
- When we learn an algorithm, we pick up not only its execution in R, but also why it is considered for a given problem, how the algorithm is trained behind the scenes, etc. In other words, we strive for a more comprehensive understanding of machine learning algorithms.
Achievements after taking this course
- Understand the most popular supervised and unsupervised machine learning algorithms:
- linear regression
- logistic regression
- \(K\)-Nearest Neighbors
- LASSO
- decision tree
- random forest
- support vector machines
- boosting
- deep learning
- Principal Component Analysis
- Clustering
- Be able to choose proper algorithms and implement it for a given problem
- Understand the limits of machine learning algorithms
- Get (more) familiar with the language R
Structure of the Course
- Component 1: Lectures covers fundamentals of machine learning methods
- Component 2: Lab sessions (last hour) for implementation in R
- Component 3: Review the corresponding sections in ISLR and practice the tutorials
My expectations
- Come to class on time and be ready to learn.
- Review contents after each lecture. Memory fades quickly if you do not refresh the contents.
- Submit homework and project on time.
- Come to office hours/make appointments if you have any questions
- The course load is on the heavy side, so prepare to spend multiple hours a week in addition to attending the lectures
Motivating Example: Predicting Income

Predicting Income using Years of Education.
- \(X\): Years of Education, the predictor.
- \(Y\): Income, the response.
Example: Predicting Income (II)

Why Estimate \(f\)?
Goal No. 1: Prediction
- In many situations, a set of inputs \(X\) are readily available, but the output \(Y\) cannot be easily obtained. Then, for a new input \(X\), we can predict \(Y\) using \[\hat Y = \hat f(X).\]
- Given an estimate \(\hat f\) and fixed \(X\), we can show that \[\begin{align}
E(Y - \hat Y)^2 &= E[f(X) + \epsilon - \hat f(X)]^2\\
&=E\{[f(X) - \hat f(X)]^2 + \epsilon^2 + 2\epsilon[f(X) - \hat f(X)]\}\\
&=[f(X) - \hat f(X)]^2 + Var(\epsilon)\\
&= \mbox{Reducible Error} + \mbox{Irreducible Error}
\end{align}\]
- We will learn methods that minimize the Reducible Error.
- Note that the Irreducible Error (unknown) is always the lower bound of the prediction error.
Why Estimate \(f\)?
Goal No. 2: Inference
- Which predictors are associated with the response? Particularly important for High-dimensional data
- What is the relationship between the response and each predictor? Positive vs. Negative? Joint Effects?
- Can the relationship between \(Y\) and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?
How Do We Estimate \(f\)?
\[income \approx \beta_0 + \beta_1 \times education + \beta_2 \times seniority\] 
How Do We Estimate \(f\)?
thin-plate spline: can be smooth or rough.


Flexibility vs. Interpretability

No Free Lunch Theorem
- The “no free lunch” (NFL) theorem for supervised machine learning is a theorem that essentially implies that no single machine learning algorithm is universally the best-performing algorithm for all problems.

Measuring Quality of Fit
- Mean Squared Error on Training Data:
\[MSE_{training} = \frac{1}{n}\sum_{i=1}^n (Y_i - \hat Y_i)^2,\]
- Imagine we have such an \(f\) so that \[\hat Y_i = f(X_i) = Y_i, \] for all \(i= 1,\cdots,n\). Then, \(MSE_{training} = 0\), do you think it is a good idea?
- Probably NOT! The real objective is to predict the response for unobserved data (a.k.a. test data). Suppose we we have a large number of test observations, we could compute \[MSE_{test} = Ave[(y_0 - \hat f(x_0))^2],\] where the average is taken over when \(x_0\) ranges over all test observations.
MSE vs. Flexibity

Decompose the test MSE
For a test observation \(x_0\), we want to minimize the expected test MSE.
\[\begin{align}
E(y_0 - \hat f(x_0))^2 &= Var(\hat f(x_0)) + [Bias(\hat f(x_0))]^2 +
Var(\epsilon)
\end{align}\]
- Bias: \(E[\hat f(x_0) - f(x_0)]\)
- Variance: \(Var[\hat f(x_0)]\)
Bias and Variance Tradeoff
